AITopics | robust speech recognition

Collaborating Authors

robust speech recognition

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Cocktail-Party Benchmark: Multi-Modal dataset and Comparative Evaluation Results

Nguyen, Thai-Binh, Zmolikova, Katerina, Ma, Pingchuan, Pham, Ngoc Quan, Fuegen, Christian, Waibel, Alexander

arXiv.org Artificial IntelligenceOct-28-2025

We introduce the task of Multi-Modal Context-Aware Recognition (MCoRec) in the ninth CHiME Challenge, which addresses the cocktail-party problem of overlapping conversations in a single-room setting using audio, visual, and contextual cues. MCoRec captures natural multi-party conversations where the recordings focus on unscripted, casual group chats, leading to extreme speech overlap of up to 100% and highly fragmented conversational turns. The task requires systems to answer the question "Who speaks when, what, and with whom?" by jointly transcribing each speaker's speech and clustering them into their respective conversations from audio-visual recordings. Audio-only baselines exceed 100% word error rate, whereas incorporating visual cues yields substantial 50% improvements, highlighting the importance of multi-modality. In this manuscript, we present the motivation behind the task, outline the data collection process, and report the baseline systems developed for the MCoRec.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2510.23276

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.52)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.50)

Add feedback

Towards Robust Speech Recognition for Jamaican Patois Music Transcription

Madden, Jordan, Stone, Matthew, Johnson, Dimitri, Geddez, Daniel

arXiv.org Artificial IntelligenceJul-24-2025

Although Jamaican Patois is a widely spoken language, current speech recognition systems perform poorly on Patois music, producing inaccurate captions that limit accessibility and hinder downstream applications. In this work, we take a data-centric approach to this problem by curating more than 40 hours of manually transcribed Patois music. We use this dataset to fine-tune state-of-the-art automatic speech recognition (ASR) models, and use the results to develop scaling laws for the performance of Whisper models on Jamaican Patois audio. We hope that this work will have a positive impact on the accessibility of Jamaican Patois music and the future of Jamaican Patois language modeling.

artificial intelligence, machine learning, whisper model, (15 more...)

arXiv.org Artificial Intelligence

2507.16834

Country: North America > Jamaica (0.15)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition

Kim, Sungnyun, Jang, Kangwook, Bae, Sangmin, Cho, Sungwoo, Yun, Se-Young

arXiv.org Artificial IntelligenceFeb-11-2025

Audio-visual speech recognition (AVSR) has become critical for enhancing speech recognition in noisy environments by integrating both auditory and visual modalities. However, existing AVSR systems struggle to scale up without compromising computational efficiency. In this study, we introduce MoHAVE (Mixture of Hierarchical Audio-Visual Experts), a novel robust AVSR framework designed to address these scalability constraints. By leveraging a Mixture-of-Experts (MoE) architecture, MoHAVE activates modality-specific expert groups, ensuring dynamic adaptation to various audio-visual inputs with minimal computational overhead. Key contributions of MoHAVE include: (1) a sparse MoE framework that efficiently scales AVSR model capacity, (2) a hierarchical gating mechanism that dynamically utilizes the expert groups based on input context, enhancing adaptability and robustness, and (3) remarkable performance across robust AVSR benchmarks, including LRS3 and MuAViC transcription and translation tasks, setting a new standard for scalable speech recognition systems.

artificial intelligence, hierarchical audio-visual expert, robust speech recognition, (1 more...)

arXiv.org Artificial Intelligence

2502.10447

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

Noise Suppression Based on Neurophysiologically-motivated SNR Estimation for Robust Speech Recognition

Neural Information Processing SystemsApr-6-2023, 17:08:49 GMT

For SNR-estimation, the input signal is transformed into so-called Amplitude Modulation Spectrograms (AMS), which rep(cid:173) resent both spectral and temporal characteristics of the respective analysis frame, and which imitate the representation of modula(cid:173) tion frequencies in higher stages of the mammalian auditory sys(cid:173) tem. A neural network is used to analyse AMS patterns generated from noisy speech and estimates the local SNR. Noise suppres(cid:173) sion is achieved by attenuating frequency channels according to their SNR. The noise suppression algorithm is evaluated in speaker(cid:173) independent digit recognition experiments and compared to noise suppression by Spectral Subtraction.

artificial intelligence, machine learning, neurophysiologically-motivated snr estimation, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.30)

Add feedback

Combining ICA and Top-Down Attention for Robust Speech Recognition

Neural Information Processing SystemsApr-6-2023, 17:08:10 GMT

We present an algorithm which compensates for the mismatches between characteristics of real-world problems and assumptions of independent component analysis algorithm. To provide additional information to the ICA network, we incorporate top-down selec(cid:173) tive attention. An MLP classifier is added to the separated signal channel and the error of the classifier is backpropagated to the ICA network. This backpropagation process results in estimation of expected ICA output signal for the top-down attention. Then, the unmixing matrix is retrained according to a new cost function representing the backpropagated error as well as independence.

algorithm, combining ica and top-down attention, robust speech recognition, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.47)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.40)

Add feedback

ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition

Neural Information Processing SystemsApr-6-2023, 16:46:54 GMT

A challenging, unsolved problem in the speech recognition com(cid:173) munity is recognizing speech signals that are corrupted by loud, highly nonstationary noise. One approach to noisy speech recog(cid:173) nition is to automatically remove the noise from the cepstrum se(cid:173) quence before feeding it in to a clean speech recognizer. In previous work published in Eurospeech, we showed how a probability model trained on clean speech and a separate probability model trained on noise could be combined for the purpose of estimating the noise(cid:173) free speech from the noisy speech. We showed how an iterative 2nd order vector Taylor series approximation could be used for prob(cid:173) abilistic inference in this model. In many circumstances, it is not possible to obtain examples of noise without speech.

cid, learning dynamic noise model, robust speech recognition, (5 more...)

Neural Information Processing Systems

Country: North America > United States > New York > New York County > New York City (0.07)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.96)

Add feedback

Audio-Visual Efficient Conformer for Robust Speech Recognition

Burchi, Maxime, Timofte, Radu

arXiv.org Artificial IntelligenceJan-4-2023

End-to-end Automatic Speech Recognition (ASR) systems based on neural networks have seen large improvements in recent years. The availability of large scale hand-labeled datasets and sufficient computing resources made it possible to train powerful deep neural networks, reaching very low Word Error Rate (WER) on academic benchmarks. However, despite impressive performance on clean audio samples, a drop of performance is often observed on noisy speech. In this work, we propose to improve the noise robustness of the recently proposed Efficient Conformer Connectionist Temporal Classification (CTC)-based architecture by processing both audio and visual modalities. We improve previous lip reading methods using an Efficient Conformer back-end on top of a ResNet-18 visual front-end and by adding intermediate CTC losses between blocks. We condition intermediate block features on early predictions using Inter CTC residual modules to relax the conditional independence assumption of CTC-based models. We also replace the Efficient Conformer grouped attention by a more efficient and simpler attention mechanism that we call patch attention. We experiment with publicly available Lip Reading Sentences 2 (LRS2) and Lip Reading Sentences 3 (LRS3) datasets. Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps. Our Audio-Visual Efficient Conformer (AVEC) model achieves state-of-the-art performance, reaching WER of 2.3% and 1.8% on LRS2 and LRS3 test sets. Code and pretrained models are available at https://github.com/burchim/AVEC.

artificial intelligence, audio-visual efficient conformer, machine learning, (1 more...)

arXiv.org Artificial Intelligence

2301.01456

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.73)

Add feedback

Robust Speech Recognition via Large-Scale Weak Supervision

Radford, Alec, Kim, Jong Wook, Xu, Tao, Brockman, Greg, McLeavey, Christine, Sutskever, Ilya

arXiv.org Artificial IntelligenceDec-6-2022

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2212.04356

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Myanmar (0.04)
South America > Brazil > Paraná > Curitiba (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Media (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Speech-enhanced and Noise-aware Networks for Robust Speech Recognition

Lee, Hung-Shin, Chen, Pin-Yuan, Cheng, Yao-Fei, Tsao, Yu, Wang, Hsin-Min

arXiv.org Artificial IntelligenceNov-22-2022

Compensation for channel mismatch and noise interference is essential for robust automatic speech recognition. Enhanced speech has been introduced into the multi-condition training of acoustic models to improve their generalization ability. In this paper, a noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition. The feature enhancement module is composed of a multi-task autoencoder, where noisy speech is decomposed into clean speech and noise. By concatenating its enhanced, noise-aware, and noisy features for each frame, the acoustic-modeling module maps each feature-augmented frame into a triphone state by optimizing the lattice-free maximum mutual information and cross entropy between the predicted and actual state sequences. On top of the factorized time delay neural network (TDNN-F) and its convolutional variant (CNN-TDNNF), both with SpecAug, the two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task. Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively. In addition, the proposed CNN-TDNNF-based system also outperforms the baseline CNN-TDNNF system on the AMI task.

artificial intelligence, machine learning, speech-enhanced and noise-aware network, (1 more...)

arXiv.org Artificial Intelligence

2203.13696

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.87)

Add feedback

Optimized Power Normalized Cepstral Coefficients towards Robust Deep Speaker Verification

Liu, Xuechen, Sahidullah, Md, Kinnunen, Tomi

arXiv.org Artificial IntelligenceSep-24-2021

After their introduction to robust speech recognition, power normalized cepstral coefficient (PNCC) features were successfully adopted to other tasks, including speaker verification. However, as a feature extractor with long-term operations on the power spectrogram, its temporal processing and amplitude scaling steps dedicated on environmental compensation may be redundant. Further, they might suppress intrinsic speaker variations that are useful for speaker verification based on deep neural networks (DNN). Therefore, in this study, we revisit and optimize PNCCs by ablating its medium-time processor and by introducing channel energy normalization. Experimental results with a DNN-based speaker verification system indicate substantial improvement over baseline PNCCs on both in-domain and cross-domain scenarios, reflected by relatively 5.8% and 61.2% maximum lower equal error rate on VoxCeleb1 and VoxMovies, respectively.

normalization, pncc, recognition, (14 more...)

arXiv.org Artificial Intelligence

2109.12058

Country:

Europe > France > Grand Est > Meurthe-et-Moselle > Nancy (0.14)
Asia > China (0.05)
North America > United States > New York (0.04)
(2 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback